How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing
نویسندگان
چکیده
Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We address this situation by combining linguistic analysis with corpus-driven statistics and obtaining better results in terms of both producing splittings according to a gold standard and statistical machine translation performance.
منابع مشابه
How to Avoid Burning Ducks: How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing
Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We ...
متن کاملArchiMob - A Corpus of Spoken Swiss German
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety rarely recorded and that it is subject to considerable regional variation. This paper presents a freely available general-purpose corp...
متن کاملCultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis
This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...
متن کاملAbstract Anaphors in German and English
Anaphors in German and English Stefanie Dipper, Christine Rieger, Melanie Seiss, and Heike Zinsmeister 1 Ruhr-University Bochum, 44780 Bochum, Germany 2 University of Konstanz, 78457 Konstanz, Germany Abstract. Abstract anaphors refer to abstract referents such as facts or events. Automatic resolution of this kind of anaphora still poses a problem for language processing systems. The present pa...
متن کاملAn Overview of Corpus-Based Statistics-Oriented(CBSO) Techniques for Natural Language Processing
A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in...
متن کامل